Reproducibility of CT-based opportunistic vertebral volumetric bone mineral density measurements from an automated segmentation framework

Background To investigate the reproducibility of automated volumetric bone mineral density (vBMD) measurements from routine thoracoabdominal computed tomography (CT) assessed with segmentations by a convolutional neural network and automated correction of contrast phases, on diverse scanners, with scanner-specific asynchronous or scanner-agnostic calibrations. Methods We obtained 679 observations from 278 CT scans in 121 patients (77 males, 63.6%) studied from 04/2019 to 06/2020. Observations consisted of two vBMD measurements from Δdifferent reconstruction kernels (n = 169), Δcontrast phases (n = 133), scan Δsessions (n = 123), Δscanners (n = 63), or Δall of the aforementioned (n = 20), and observations lacking scanner-specific calibration (n = 171). Precision was assessed using root-mean-square error (RMSE) and root-mean-square coefficient of variation (RMSCV). Cross-measurement agreement was assessed using Bland-Altman plots; outliers within 95% confidence interval of the limits of agreement were reviewed. Results Repeated measurements from Δdifferent reconstruction kernels were highly precise (RMSE 3.0 mg/cm3; RMSCV 1.3%), even for consecutive scans with different Δcontrast phases (RMSCV 2.9%). Measurements from different Δscan sessions or Δscanners showed decreased precision (RMSCV 4.7% and 4.9%, respectively). Plot-review identified 12 outliers from different scan Δsessions, with signs of hydropic decompensation. Observations with Δall differences showed decreased precision compared to those lacking scanner-specific calibration (RMSCV 5.9 and 3.7, respectively). Conclusion Automatic vBMD assessment from routine CT is precise across varying setups, when calibrated appropriately. Low precision was found in patients with signs of new or worsening hydropic decompensation, what should be considered an exclusion criterion for both opportunistic and dedicated quantitative CT. Relevance statement Automated CT-based vBMD measurements are precise in various scenarios, including cross-session and cross-scanner settings, and may therefore facilitate opportunistic screening for osteoporosis and surveillance of BMD in patients undergoing routine clinical CT scans. Key Points Artificial intelligence-based tools facilitate BMD measurements in routine clinical CT datasets. Automated BMD measurements are highly reproducible in various settings. Reliable, automated opportunistic osteoporosis diagnostics allow for large-scale application. Graphical Abstract

• Intra-and inter-scan as well as interscanner reproducibility of volumetric bone mineral density (BMD) measurements was assessed resulting in 679 observations from 278 CT scans.
• A high precision across different reconstruction kernels and contrast phases was shown.
• Inter-scanner reproducibility was lower in patients with new or worsening signs of hydropic decompensation.
Automated volumetric BMD accurately measures bone density in routine CT scans enabling opportunistic osteoporosis screening.

Background
Osteoporosis is a systemic disease, that destabilizes the bone by demineralization of the osseous tissue and deterioration of the trabecular microstructure [1,2].The diagnosis is frequently delayed, because patients remain symptom free, until fragility fractures occur.Those occur in the absence of adequate trauma, but inherit significant morbidity, mortality and enormous socioeconomic consequences [3].Demographic change aggravates this issue.
In the USA, approximately 54 million people over the age of 50 were estimated to suffer from low bone mass or osteoporosis by 2010, and the number is projected to reach 71 million by 2030 [4].Dual x-ray absorptiometry and quantitative computed tomography (CT) suited to screen for and diagnose osteoporosis are available [5,6].However, osteoporosis remains vastly underdiagnosed [2,6,7].Part of the problem is that both methods rely on specialized tools (e.g., calibration phantoms), and, importantly, necessitate a dedicated exam, which inherits a substantial organizational effort.
Opportunistic approaches aim to overcome the limitations by determining bone mineral density (BMD) from exams performed for other indications [8].The abundance of CT data underscores the possible impact of CTbased opportunistic BMD screening approaches: annual scans in the USA surpassed 278 per 1,000 inhabitants by 2019 [9].However, the need for exact manual segmentations to determine trabecular volumetric BMD (vBMD) from routine clinical multidetector CT scans limited the viability of this approach so far.
Deep learning-based convolutional neural network frameworks have recently been developed to master this challenge [10].Such neural networks automatically perform all steps of vertebral body segmentation.The conversion of asynchronous CT-based density values measured in HU into vBMD and corrections for the intravenous contrast media phase can be performed automatically [11][12][13].However, to render this approach suitable for mass application, reproducibility has yet to be determined.
Thus, this study aimed to provide comprehensive information on the reproducibility of vBMD measurements performed by a fully automated convolutional neural network framework, utilizing routine clinical thoracoabdominal CT, obtained in different contrast media phases on a diverse set of scanner systems, with varying settings, asynchronously calibrated as well as using a manufacturer-generic, kVp-based calibration.

Study population
Patients who received at least two consecutive routine thoracoabdominal CT scans with an interscan interval of up to one month and matching regions of interest between April 2019 and June 2020 were identified from the local picture archiving and communication system.The maximum interscan interval of one month was selected to maximize the number of patients eligible for inclusion, while simultaneously aiming to rule out longitudinal changes in vBMD, which have been reported to reach 2% per year in a healthy cohort [14].To cover the widest possible range of CT scanning systems in this study, scans performed at other institutions but that were imported into our system were also included.All scans were manually checked for lumbar spine coverage by a neuroradiology resident (P.P., 3 years of experience in spine imaging).Scans not covering the lumbar spine, as well as scans with severe beam hardening artifacts or high noise level at the lumbar spine (e.g., due to implants or other foreign material), were excluded.Further exclusion parameters were the presence of inflammatory and neoplastic lesions at the lumbar spine.
In-house scans were obtained on a set of four scanners (Philips Brilliance iCT 256, Philips IQon Spectral CT, and Philips Ingenuity, Philips Medical Systems, Hamburg, Germany; Siemens Somatom Definition AS + , Siemens Healthineers, Erlangen, Germany).External scans were performed on one of eight scanner types (Canon Aquilion, and Canon Aquilion PRIME, Canon Medical Systems, Amstelveen, Netherlands; Siemens Biograph, Siemens Somatom Emotion 16, Siemens Somatom Definition AS, Siemens Somatom Force and Siemens Somatom Emotion 6, Siemens Healthineers, Erlangen, Germany; and Philips Ingenuity Core 128, Philips Medical Systems, Hamburg, Germany).
All reformations with a spatial resolution of ≤ 3 mm craniocaudally and of 5 mm left-right or anteriorposterior were included.In-house scanners were asynchronously calibrated using a commercially available anthropomorphic spine phantom (QRM QSA-717 Phantom; Quality Assurance in Radiology and Medicine GmbH, Möhrendorf, Germany).
In all available reconstructions, automated spine detection, vertebral labelling, and trabecular compartment segmentation steps were performed automatically (SpineQ, version 1.0, Bonescreen GmbH, Munich, Germany, Fig. 1).HU-to-BMD calibration was performed automatically using linear conversion factors based on kVp and scanner type.Automated correction for the contrast media phase was performed using a twodimensional DenseNET model [12].Mean trabecular vBMD was calculated across the L1-4 lumbar vertebra (vBMD L1-4 ) for each observation.Vertebrae considered as unmeasurable according to the ACR criteria [15] were excluded from this mean on a per-patient basis.Specifically, vertebrae with fractures or Modic type 3 changes affecting > 25% of the vertebrae were excluded.For quality assurance, a neuroradiologist trained in spine imaging (P.P., 3 years of experience) reviewed vertebral body segmentation, automatic contrast media phase detection, and supervised in-/exclusion of vertebrae, documenting any manual corrections, if necessary.

Group definitions
Acquisition parameters such as kVp, reconstruction kernel, and slice thickness, as well as scanner model, scan positioning, intravenous contrast media phase, and calibration status are well-known confounders of vBMD measurements [12,[16][17][18][19].To quantify each factor's impact on reproducibility, we defined a set of six groups with an expected increase in variance.Observations were assigned to groups based on the following criteria: I. Δ recon : both measurements derived from a single acquisition but using different reconstruction kernels and / or slice thicknesses; II.Δ contrast : measurements obtained from different contrast media phases, obtained in consecutive acquisitions during a single scan session; III.Δ session : measurements derived from different scanning sessions at a single calibrated scanner, but both scans had the same contrast media phase; IV.Δ scanner : measurements assessed in the same contrast media phase, but at two different calibrated scanners; V. Δ all : measurements obtained from two different scanners, in different contrast media phases; VI. scanner-agnostic : measurements obtained from datasets from two different scanners, at least one of which was not asynchronously calibrated, and only a kVpspecific, but scanner-independent, calibration was applied.Groups I-V consisted of observations from asynchronously calibrated scanners only.Reconstruction kernels, slice thicknesses, and reconstruction planes were not controlled for and varied randomly as per acquisition protocol.

Statistical analysis
All statistical analyses were performed using STATA software version 13.1 (StataCorp LLC, College Station, Texas, USA).Absolute interscan differences in vBMD L1-4 were calculated by subtracting the vBMD L1-4 measurement 2 from the vBMD L1-4 measurement 1, in each observation, respectively.Relative interscan differences were calculated as percent gain or loss in vBMD L1-4 between measurement 1 and measurement 2, in all observations.Mean absolute vBMD differences and relative vBMD differences and respective standard deviations were calculated groupwise (I-IV).
Root mean square error (RMSE) and root mean square coefficients of variation (RMSCV) were calculated as measure of variance, for each group.To further investigate agreement of both measurements on single observation level, Bland Altman plots were created including 95% confidence intervals (95% CIs) of limits of agreement (mean difference × 1.96 standard deviation of the difference).The statistical significance of group differences in coefficients of variation was assessed using unifactorial ANOVA and Tukey post-hoc test.
Observations with poor agreement, defined as absolute difference exceeding the inner boundaries of the 95% CI limits of agreement, were retrospectively reviewed by an experienced neuroradiologist, specialized on spine imaging (J.S.K., 22 years of experience), and possible factors influencing vBMD measurements were recorded manually.

Reproducibility measurements
As expected, RMSCV values increased along the groups from a minimum of 1.3% in measurements obtained from identical scans (Δ recon ) to a maximum of 5.9% in the Δ all cohort (Table 2).The RMSCV increased significantly between Δ recon and Δ contrast (p < 0.001) and Δ contrast and Δ session (p < 0.001), while the increases between Δ session and Δ scanner as well as Δ scanner and Δ all were not statistically significant (p ≥ 0.770, respectively).The RMSCV difference between Δ all and scanner-agnostic did barely not reach statistical significance (p = 0.054).As reflected in the RMSCV, vBMD L1-4 values for both measurements in the Δ recon group (n = 169) were extremely similar, with an average absolute difference of 0.1 ± 3.1 mg CaHA/cm 3 and a relative difference of 0.1 ± 2.3%.Consequently, the group showed an overall low absolute error with an RMSE of 3.1 mg CaHA/cm 3 .
The relative error increased further in Δ all-observations , to 5.9%.Notably, measurements in scanner-agnostic scans showed acceptable absolute (2.0 ± 10.0 mg CaHA/cm 3 ) and relative (2.0 ± 5.9%) differences, reflected in the mean absolute (RMSE = 9.6 mg CaHA/cm 3 ) and relative error (RMSCV = 4.2%).Bland Altman plot analysis showed good agreement for all groups, except in Δ session .In this group, 12 observations exceeded the lower 95% CI of the upper limit of agreement (25.5 mg CaHA/cm 3 ) or the upper 95% CI of the lower limit of agreement (-20.15mg CaHA/cm 3 ) (Fig. 4).Manual review showed that all those patients were scanned twice within a short period due to severe illness with signs of hydropic decompensation, resulting in new or increasing pleural effusion (n = 11), anasarca (n = 10), mesenterial fluid injection or ascites (n = 5) and pulmonary septal thickening (n = 3).Two of the patients were intubated between baseline and follow-up and had received new abdominal drainages (Fig. 5).Exclusion of the identified outliers resulted in lower value dispersion and in improvement of absolute and relative precision errors (Δ session_no-outliers n = 111; absolute difference = 0.6 ± 9.5; relative difference 0.3 ± 5.9; RMSE = 9.0; RMSCV = 4.1).

Discussion
This study investigated the reproducibility of vBMD assessments from routine clinical CT scans using a fully automated convolutional neural network framework in various settings, including cross-scanner and cross-center settings.Reproducibility was excellent for all comparisons derived from a single scanning session.This demonstrates that neither the input convolution kernel nor slice orientation or thickness diminish reproducibility of this approach, and that the contrast media phase can be effectively corrected for.Precision errors for measurements derived from two different scanning sessions or different scanners were higher, but acceptable.While measurement errors may be due in part to changes in patient positioning, we also found evidence that measurement errors may also be driven up by short-term changes in tissue water content of severely ill patients and recommend introducing hydropic decompensation as a general exclusion criterion for quantitative CT measurements.
The Δ recon group showed excellent reproducibility of vBMD L1-4 measurements with precision errors of approximately 1.5% or 3 mg CaHA/cm 3 .Similar precision has been shown for dual X-ray absorptiometry in repeated measurements without and with repositioning [20][21][22].Since the two measurements in this group were derived from Δ session_no-outliers comprises the Δ session group following exclusion of outliers, which exceeded the inner boundaries of the 95% confidence interval limit of agreement of the Bland Altman plot in Fig. 4 reformations created using different reformations of a single scan, the precision error is most likely attributable to slight differences in the segmentation process caused by the different reconstruction kernels, slice orientations and thicknesses, resulting in slightly different mean HU values or volume-of-interest placements by the convolutional Bonescreen neural network framework [10,23].Overall, the findings underline the previously reported robustness of the automated segmentation algorithm and HU-to-vBMD conversion following asynchronous calibration [10,21,24].The influence of the contrast media phase on vBMD measurements is well known, especially in setups with external calibration, as the contrast medium augments the attenuation of vascularized body parts but does not influence the reference values obtained in the external phantom [8,25,26].Therefore, a set of studies investigated correction methods for the contrast phase, and Rühling et al recently developed an automated model for contrast media phase detection and correction in a singlescanner setting [12,25,27,28].The study reported precision errors of 9.5 CaHA/cm 3 in arterial and 4.0 mg CaHA/cm 3 in portal-venous phase [12].Contrast phase correction using phase-dependent correction factors performed similarly well in the current dataset, derived from various scanners, with absolute and relative precision errors of 6.1 mg CaHA/cm 3 and 2.9% in the Δ contrast group.Moreover, precision errors increased only slightly in the Δ all group, compared to the Δ session and Δ scanner groups.This indicates that the implemented contrast media phase correction may work similarly well in Fig. 3 a Sagittal reformations of two thoracolumbar CT scans (left, unenhanced; right, 90 s after intravenous contrast media administration).Scans were obtained for search of an endoleak of the aortic prothesis (arrowhead).Annotations show HU of the abdominal aorta and the L3 vertebral body.Measured bone mineral density was 96 mg CaHA/cm 3 in the unenhanced scan and 98 mg CaHA/cm 3 in portal-venous phase.b Axial reformations of two unenhanced CT scans of a single patient obtained 20 days apart on the same scanner (Philips IQon).Images show cross-sections at the L3 level.Blue circles represent the maximum field of view.Blue lines intersect at the scanner center.The substantial difference in patient placement and distance between the lumbar spine and the scanner center between scans is evident and may explain substantial differences in bone mineral density measurements (top, 207.7 mg CaHA/cm 3 ; bottom, 227.2 mg CaHA/cm 3 ).Of note, the bottom scan was obtained following oral contrast administration cross-scanner comparisons.Modern advancements in CT, like spectral imaging, can further minimize the impact of intravenous contrast or other foreign materials on vBMD measurements and has been shown to yield promising results for vBMD measurements and in fracture prediction with good reproducibility [29][30][31].While there is barely any use case for spectral CT in mass opportunistic osteoporosis screening due to the limited scanner availability, it may prove to be a pivotal advancement in osteoporosis diagnostics in the future.Patient positioning and table height are known to severely impact attenuation values in multidetector CT due to x-ray field inhomogeneities and beam hardening effects being highly dependent on the scanner's isocenter, and thereby also affect vBMD measurements [16][17][18][19].This may in part explain the higher RMSCV and RMSE values in the Δ session , Δ scanner, and Δ all groups, as all observation pairs in these groups were obtained in different scanning sessions and to a certain extent, differences in patient positioning and table height can be expected between sessions.To encounter this problem, internal calibration has been proposed as an alternative calibration method.Internal calibration uses tissuespecific HU values, e.g., fat and muscle, to calculate a scan-specific conversion factor from HU to BMD [32].While the method showed promising results in the past, studies have found that it is not superior to asynchronous calibration [27].
Across groups with patient repositioning, retrospective examination of cases with the greatest vBMD-variability revealed that the measurements were partially derived from severely ill patients from the intensive care units of our hospital, who had fluctuating levels of intraabdominal as well as interstitial fluid and pleural effusion between scans.This bias is particularly difficult to correct for since hydration status in intensive care unit patients may substantially fluctuate daily.We regard this finding particularly important, as hydration status is not a reported confounder for quantitative CT measurements, which thus ought to be critically revised in severely ill patients.However, since our study was not specifically designed to investigate this topic, it warrants further investigation.
Data on cross-session reproducibility for asynchronous vBMD measurements is scarce, even more so for crossscanner settings.Previous reports of reproducibility measures for asynchronous quantitative CT in single-scanner settings showed precision errors of 3−4 mg CaHA/cm 3 or 2.2−3.7%[33,34].Sollmann et al compared opportunistically assessed vBMD from a set of six different scanners with QCT and documented different degrees of variation per scanner; an approach that seems somewhat comparable to cross-scanner results [11].However, the authors did not measure cross-scanner reproducibility directly.With respect to the possible bias of the hydration status, precision errors across Δ session , Δ scanner and Δ all may be regarded as acceptable, with a maximum relative error of 5.9% in the Δ all group and a maximum absolute error of 11 mg CaHA/ cm 3 in Δ session .Scanner-agnostic observation pairs showed similar reproducibility to asynchronously calibrated measurements from different sessions, and scanner-agnostic yielded better results than Δ all , and did only barely not reach statistical significance.In fact, both the RMSE of 10.1 mg CaHA/cm 3 and the RMSCV of 3.7% were slightly lower in scanner-agnostic compared to the Δ session group and the RMSCV was markedly lower in scanner-agnostic compared to Δ all .Both results suggest that scanner-specific phantom measurements may not necessarily be needed for asynchronous calibration if kVp-specific calibration factors are available for the scanner type.
We acknowledge some limitations of our study.First, we extended the generalizability of our results by including several scanners from outside of our institution.However, despite our efforts, we did not achieve an equal distribution across all main vendors.This may have an adverse impact on the external validity of the determined reproducibility.However, from our point of view, this issue can only be solved in multicentric studies, preferably with scannerspecific asynchronous calibration.Nonetheless, we demonstrated, that even scanner-agnostic kVp-based calibration yields acceptable reproducibility, regardless of the scanner combination.Second, cross-session reproducibility was limited, as we included many severely ill patients.It remains unclear, whether the observed increases in precision errors were attributable to patient positioning or caused by pathophysiological changes to the body composition like changes in the hydration status.Since this issue has not been reported on in the literature and this study was not designed to further investigate this finding, it necessitates further investigation.However, this problem appears to be difficult, as it seems to be inherent in the typical design for this type of study, because healthy individuals would rarely receive two thoracoabdominal CT scans within a single month.
To summarize, the automatic vBMD measurements by a convolutional neural network-based tool with asynchronous calibration and automated correction for the contrast media phase investigated in this study showed good reproducibility.The slightly lower precision in cross-session and cross-scanner settings may be related to patient positioning, but also short-term changes to the patients' body compositions, necessitating further investigations.Notably, precision was similar in cross-session settings and the group without scanner-dedicated, asynchronous phantom-based calibration.Patient positioning and body composition may thus be of interest as major determinants of reproducibility for further studies.

Fig. 1
Fig. 1 Steps of the automated segmentation by Bonescreen. a Vertebral body detection and labeling.Vertebral segmentation (b, sagittal view; c, coronal view), including posterior elements (d).e Identification of cortical and trabecular bone.f Three-dimensional model of segmented vertebrae

Fig. 4 a
Fig. 4 a-f Bland-Altman plots visualizing agreement of measurements on per-observation basis, in each group.For each observation (grey dots), the difference between measurement 1 and measurement 2 is plotted against the group mean.The group mean is indicated by the short-dashed line, while the long-dashed lines indicate the limits of agreement (± 2 standard deviations [SD]) (dotted lines)

Fig. 5 a
Fig.5a Axial reformations of two scans of the same patient, obtained 16 days apart at the L2 level (top: baseline; bottom: follow-up).Derived bone mineral density measurements changed significantly between scans (baseline: 148.3 mg CaHA/cm 3 ; follow-up: 175.1 mg CaHA/cm3) .Manual case review revealed that the patient suffered multiple intraabdominal abscesses between scans.Subcutaneous fat HU increased from -25 to +2 between scans as sign of hydropic decompensation.Also note the progressive mesenterial fluid injections and paracolic ascites.b Imaging at the L1 level revealed pleural effusion in the follow-up scan (bottom) of this patient, 19 days after baseline (top).The patient also showed an increase of subcutaneous fat attenuation from -83 HU at baseline to -59 HU at follow-up, co-occurring with an increase in measured bone mineral density (baseline 111.4 mg CaHA/cm 3 ; followup: 133.1 mg CaHA/cm3)

Table 1
Cohort demographics a Values based on a number of observations

Table 2
Reproducibility of fully automated vBMD L1-4 measurements in each group a Group contains asynchronously calibrated measurements only.vBMD Volumetric bone mineral density, CaHA Calcium hydroxylapatite, RMSE Root mean square error, RMSCV Root mean square coefficient of variation.